EDA - Exploratory Data Analysis!¶

Analysis of data saved in the file student_lifestyle_dataset.csv downloaded from the Kaggle website.¶

This dataset, titled "Daily Lifestyle and Academic Performance of Students", contains data from 2,000 students collected via a Google Form survey. It includes information on study hours, extracurricular activities, sleep, socializing, physical activity, stress levels, and CGPA. The data covers an academic year from August 2023 to May 2024 and reflects student lifestyles primarily from India. This dataset can help analyze the impact of daily habits on academic performance and student well-being.¶

  • File Format: CSV
  • File Name: Daily_Lifestyle_and_Academic_Performance.csv
  • Number of Records: 2000 rows
  • Number of Columns: 8 columns
  • Column Names: Student ID, Study Hours, Extracurricular Hours, Sleep Hours, Social Hours, Physical Activity Hours, Stress Level, GPA
  • File Size: Approximately 150 KBsl 150 KB

Student project.¶

In [5]:
# import of necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
In [3]:
# defining student_df variable and reading the csv file into the DataFrame
student_df = pd.read_csv('student_lifestyle_dataset.csv', sep=",")
In [4]:
student_df
Out[4]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
0 1 6.9 3.8 8.7 2.8 1.8 2.99 Moderate
1 2 5.3 3.5 8.0 4.2 3.0 2.75 Low
2 3 5.1 3.9 9.2 1.2 4.6 2.67 Low
3 4 6.5 2.1 7.2 1.7 6.5 2.88 Moderate
4 5 8.1 0.6 6.5 2.2 6.6 3.51 High
... ... ... ... ... ... ... ... ...
1995 1996 6.5 0.2 7.4 2.1 7.8 3.32 Moderate
1996 1997 6.3 2.8 8.8 1.5 4.6 2.65 Moderate
1997 1998 6.2 0.0 6.2 0.8 10.8 3.14 Moderate
1998 1999 8.1 0.7 7.6 3.5 4.1 3.04 High
1999 2000 9.0 1.7 7.3 3.1 2.9 3.58 High

2000 rows × 8 columns

General overview of the data.¶

In [6]:
student_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Student_ID                       2000 non-null   int64  
 1   Study_Hours_Per_Day              2000 non-null   float64
 2   Extracurricular_Hours_Per_Day    2000 non-null   float64
 3   Sleep_Hours_Per_Day              2000 non-null   float64
 4   Social_Hours_Per_Day             2000 non-null   float64
 5   Physical_Activity_Hours_Per_Day  2000 non-null   float64
 6   GPA                              2000 non-null   float64
 7   Stress_Level                     2000 non-null   object 
dtypes: float64(6), int64(1), object(1)
memory usage: 125.1+ KB
In [7]:
student_df.head() # displaying initial values
Out[7]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
0 1 6.9 3.8 8.7 2.8 1.8 2.99 Moderate
1 2 5.3 3.5 8.0 4.2 3.0 2.75 Low
2 3 5.1 3.9 9.2 1.2 4.6 2.67 Low
3 4 6.5 2.1 7.2 1.7 6.5 2.88 Moderate
4 5 8.1 0.6 6.5 2.2 6.6 3.51 High
In [8]:
student_df.tail() # displaying final values
Out[8]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
1995 1996 6.5 0.2 7.4 2.1 7.8 3.32 Moderate
1996 1997 6.3 2.8 8.8 1.5 4.6 2.65 Moderate
1997 1998 6.2 0.0 6.2 0.8 10.8 3.14 Moderate
1998 1999 8.1 0.7 7.6 3.5 4.1 3.04 High
1999 2000 9.0 1.7 7.3 3.1 2.9 3.58 High
In [9]:
student_df.sample(15) # displaying 15 random records
Out[9]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
700 701 6.7 2.0 8.1 5.5 1.7 3.11 Moderate
1860 1861 6.2 4.0 6.0 3.5 4.3 2.88 Moderate
1650 1651 6.3 3.2 9.7 0.7 4.1 2.65 Moderate
224 225 8.9 0.4 8.8 3.2 2.7 3.32 High
1115 1116 6.1 2.9 6.6 3.0 5.4 3.01 Moderate
848 849 10.0 1.0 9.4 3.0 0.6 3.40 High
6 7 8.0 0.7 5.3 5.7 4.3 3.08 High
1984 1985 8.5 0.3 7.1 3.4 4.7 3.23 High
1870 1871 5.6 3.3 6.3 4.7 4.1 2.85 Low
861 862 8.7 0.1 7.9 5.6 1.7 3.30 High
1024 1025 7.6 4.0 8.1 1.6 2.7 3.02 Moderate
466 467 7.2 1.2 6.3 1.2 8.1 3.05 Moderate
1717 1718 7.4 1.2 7.6 2.3 5.5 3.03 Moderate
153 154 7.1 2.9 9.7 4.0 0.3 3.33 Moderate
1877 1878 8.0 2.0 8.2 5.6 0.2 3.40 Moderate
In [10]:
student_df.describe() # displaying statistics for numeric columns
Out[10]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA
count 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.00000 2000.000000
mean 1000.500000 7.475800 1.990100 7.501250 2.704550 4.32830 3.115960
std 577.494589 1.423888 1.155855 1.460949 1.688514 2.51411 0.298674
min 1.000000 5.000000 0.000000 5.000000 0.000000 0.00000 2.240000
25% 500.750000 6.300000 1.000000 6.200000 1.200000 2.40000 2.900000
50% 1000.500000 7.400000 2.000000 7.500000 2.600000 4.10000 3.110000
75% 1500.250000 8.700000 3.000000 8.800000 4.100000 6.10000 3.330000
max 2000.000000 10.000000 4.000000 10.000000 6.000000 13.00000 4.000000

Preliminary observations.¶

According to the analyzed data set, students study on average 7 hours a day. The maximum learning time is 10 hours a day.¶

The maximum amount of time allocated to extracurricular activities is 4 hours per day.¶

Students sleep on average about 7 hours, with the shortest sleep time being 5 hours and the longest 10 hours. On average, social hours are over 2 hours.¶

They spend an average of 4 hours a day on physical activity, with a maximum of 13 hours.¶

The highest GPA is 4 and the lowest is 2.24¶

Missing value analysis.¶

In [11]:
student_df.isnull().sum()
Out[11]:
Student_ID                         0
Study_Hours_Per_Day                0
Extracurricular_Hours_Per_Day      0
Sleep_Hours_Per_Day                0
Social_Hours_Per_Day               0
Physical_Activity_Hours_Per_Day    0
GPA                                0
Stress_Level                       0
dtype: int64
In [12]:
student_df[student_df.duplicated()] # displaying duplicates
Out[12]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level

The analyzed set has no missing values ​​and no duplicates.¶

Single variable analysis.¶

In [13]:
sns.displot(data = student_df, x = "Study_Hours_Per_Day", col = "Stress_Level", kde=True) 
 #distribution of numerical variables divided into stress levels
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x207e41b5550>
No description has been provided for this image
In [14]:
sns.displot(data=student_df, x="Extracurricular_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x207ec8acc50>
No description has been provided for this image
In [15]:
sns.displot(data=student_df, x="Sleep_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x207ec4fb1d0>
No description has been provided for this image
In [16]:
sns.displot(data=student_df, x="Social_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x207ed2b3490>
No description has been provided for this image
In [17]:
sns.displot(data=student_df, x="Physical_Activity_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[17]:
<seaborn.axisgrid.FacetGrid at 0x207ef064bd0>
No description has been provided for this image
In [18]:
sns.displot(data=student_df, x="GPA", col = "Stress_Level", kde=True) 
Out[18]:
<seaborn.axisgrid.FacetGrid at 0x207ee8158d0>
No description has been provided for this image
In [19]:
student_df['Stress_Level'].value_counts()
Out[19]:
High        1029
Moderate     674
Low          297
Name: Stress_Level, dtype: int64
In [20]:
student_df[student_df['GPA']==4.0] # displaying the student with the highest GPA
Out[20]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
51 52 9.0 2.6 8.5 3.1 0.8 4.0 High
In [21]:
student_df[student_df['GPA']==2.24] # displaying the student with the lowest GPA
Out[21]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
764 765 5.5 1.8 6.7 5.2 4.8 2.24 Low

Short observations.¶

Most of the surveyed students are characterized by high levels of stress. Low stress levels occur only in 297 students out of 2,000 analyzed.¶

Low-stress students spent about 5-6 hours a day studying. In the group with a moderate level of stress, we notice the period devoted to learning - over 5 hours, but less than 9.¶

However, in the group of students with a high level of stress, this range is wide - from 5 to 10 hours a day for studying. However, most of them study for about 9 hours. Only in the group of people with high levels of stress do we observe the amount of sleep less than 6 hours.¶

In the group of students with high levels of stress, we observe that most of them spend less than 6 hours sleeping.¶

Analysis of relationships between variables.¶

In [22]:
plt.scatter('Study_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Study_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between study hours and grade point average.')
plt.show()
No description has been provided for this image

The longer the time spent studying, the higher the average grade.¶

In [23]:
sns.relplot(
    data=student_df,
    x="GPA",
    y="Study_Hours_Per_Day",
    col="Stress_Level",
    hue="Stress_Level",
)
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x207efaa09d0>
No description has been provided for this image
In [24]:
sns.lmplot(data=student_df, x="GPA", y="Study_Hours_Per_Day", col="Stress_Level", hue="Stress_Level")
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x207f1e49b10>
No description has been provided for this image
In [25]:
sns.jointplot(data=student_df, x="Study_Hours_Per_Day", y="GPA", hue="Stress_Level")
Out[25]:
<seaborn.axisgrid.JointGrid at 0x207ec773750>
No description has been provided for this image
In [26]:
fig = px.scatter(student_df, x="GPA", y="Extracurricular_Hours_Per_Day")
fig.update_layout(
    title="Relationship between extracurricular activities and grade point average",
    xaxis_title="GPA",
    yaxis_title="Extracurricular_Hours_Per_Day",
)
fig.show()
In [27]:
plt.scatter('Sleep_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Sleep_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between sleep hours and grade point average.')
plt.show()
No description has been provided for this image
In [28]:
sns.relplot(
    data=student_df,
    x="GPA",
    y="Sleep_Hours_Per_Day",
    col="Stress_Level",
    hue="Stress_Level",
    
)
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x207f4bfe590>
No description has been provided for this image

In the group of students with high levels of stress, we notice that low average grades go hand in hand with short sleep time.¶

In [29]:
sns.relplot(
    data=student_df,
    x="Sleep_Hours_Per_Day",
    y="Study_Hours_Per_Day",
    col="Stress_Level",
    hue="GPA",
    size="GPA",
)
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x207f4c103d0>
No description has been provided for this image

In the group of students with high levels of stress, average grades increase with more time spent studying and sleeping.¶

For low-stress students, we see no apparent change in GPA growth with more sleep.¶

In [30]:
plt.scatter('Social_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Social_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between social activity hours and grade point average.')
plt.show()
No description has been provided for this image
In [31]:
sns.relplot(
    data=student_df,
    x="Study_Hours_Per_Day",
    y="Social_Hours_Per_Day",
    col="Stress_Level",
    hue="GPA",
    #size="Stress_Level",
)
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x207f597b750>
No description has been provided for this image
In [32]:
plt.scatter('Physical_Activity_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Physical_Activity_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between physicial activity hours and grade point average.')
plt.show()
No description has been provided for this image
In [33]:
sns.jointplot(data=student_df,  x="Social_Hours_Per_Day", y="Physical_Activity_Hours_Per_Day", hue="Stress_Level")
Out[33]:
<seaborn.axisgrid.JointGrid at 0x207f512c4d0>
No description has been provided for this image
In [34]:
sns.relplot(
    data=student_df,
    x="Social_Hours_Per_Day",
    y="Physical_Activity_Hours_Per_Day",
    col="Stress_Level",
    hue="GPA",
    size="GPA",
)
Out[34]:
<seaborn.axisgrid.FacetGrid at 0x207f5b209d0>
No description has been provided for this image

We observe mainly in the group of students with low and moderate stress that as the hours devoted to activity increase, the amount of social time decreases. However, in a group with a high level of stress and more time devoted to physical and social activities, the lower the GPA.¶

In [35]:
sns.relplot(
    data=student_df,
    kind="line",
    x="Sleep_Hours_Per_Day",
    y="Physical_Activity_Hours_Per_Day",
    style="Stress_Level",
    hue="Stress_Level",
    
)
Out[35]:
<seaborn.axisgrid.FacetGrid at 0x207f5c42450>
No description has been provided for this image
In [36]:
correlation_df = student_df[['Study_Hours_Per_Day','Extracurricular_Hours_Per_Day','Sleep_Hours_Per_Day', 'Social_Hours_Per_Day', 'Physical_Activity_Hours_Per_Day', 'GPA',]].corr()
In [37]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_df, annot=True, cmap='coolwarm', linewidths=.5)

plt.title('Heatmap showing correlations between variables')
plt.show()
No description has been provided for this image

There is a noticeable correlation between the time spent studying and the average grade.¶

In [38]:
sns.pairplot(data=student_df, hue="Stress_Level")
Out[38]:
<seaborn.axisgrid.PairGrid at 0x207f5b8de10>
No description has been provided for this image

Outlier analysis.¶

In [39]:
student_df.groupby('Stress_Level').plot(kind='box', figsize=(20,8), grid=True)
Out[39]:
Stress_Level
High        Axes(0.125,0.11;0.775x0.77)
Low         Axes(0.125,0.11;0.775x0.77)
Moderate    Axes(0.125,0.11;0.775x0.77)
dtype: object
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Outliers appear in GPA, Study Hours and Physical Activity Hours.¶

In [40]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level', y='GPA', hue='Stress_Level')


plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('GPA', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()


plt.show()
No description has been provided for this image
In [41]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level',  y='Physical_Activity_Hours_Per_Day', hue='Stress_Level')


plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('Physical_Activity_Hours_Per_Day', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()


plt.show()
No description has been provided for this image
In [42]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level',  y='Study_Hours_Per_Day', hue="Stress_Level" )


plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('Study_Hours_Per_Day', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()


plt.show()
No description has been provided for this image

Analysis summary.¶

We observe a growing relationship between study hours and grade point average. The longer the learning period, the higher the grade and the increase in stress level.¶

Students in the low-stress, low-study group achieve average GPAs of 2.25 to 3.50.¶

We do not observe a significant relationship between extracurricular activities and grade point average.¶

Among students with low levels of stress, lower GPAs occur regardless of sleep amount. These are values ​​up to a maximum of 3.5 GPA. In the group of students with moderate stress, GPA is higher on average regardless of sleep time, but in this group the time spent studying is greater than in students with low stress.¶

In turn, in the group of students with high levels of stress, values ​​above 3.9 GPA appear, but only when they study for longer than 8 hours. We do not observe a relationship between sleep and GPA, as after 8 hours of study we observe high average grades both with approximately 5 hours of sleep and after 9 hours of sleep.¶

It can be noticed that as the time spent on physical activity increases, the number of hours spent on social life decreases.¶

The question arises whether grades at university are really more important than our health and comfort of life. Can it be said that people with higher academic results but less physical and social activity are satisfied? Do people with lower stress levels but greater social and physical activity lose something by achieving a lower grade point average?¶